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We welcome this critique of simplistic one-dimen- 
sional measures of academic performance, in partic- 
ular the naive use of impact factors and the h-index, 
and we can only extend sympathy to colleagues who 
are being judged using some of the techniques de- 
scribed in the paper. In particular we welcome the 
report's emphasis on the need for careful modeling of 
citation data rather than relying on simple summary 
statistics. Our own work on league tables adopts a 
modeling approach that seeks to understand the fac- 
tors associated with institutional performance and 
at the same time to quantify the statistical uncer- 
tainty that surrounds institutional rankings or fu- 
ture predictions of performance. In the present com- 
mentary we extend this approach to an analysis of 
the 2008 UK Research Assessment Exercise (RAE) 
for Universities. 

Before we describe our analysis it is important to 
comment on an important modeling problem that 
arises in the analysis of citation data, alluded to but 
not discussed in detail in the report, nor, as far as 
we know, elsewhere. A principal difficulty with in- 
dices such as the h-index or simple citation counts is 
that there are inevitable dependencies between indi- 
vidual scientists' values. This is because a citation is 
to a paper with, in general, several authors, rather 
than to each specific author. Thus, for example, if 
two authors nearly always write all their papers to- 
gether, they will tend to have very similar values. If 
they belong to the same university department then 
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their scores do not supply independent bits of in- 
formation in compiling an overall score or rank for 
that department. Currently this issue is recognized 
in the RAE, albeit imperfectly, by the requirement 
that the same paper cannot be entered more than 
once by different authors for a given university de- 
partment. In a citation based system this would also 
need to be recognized. 

In addition, if our two authors were in different, 
competing departments, we would also need to rec- 
ognize this since the dependency would affect the 
accuracy of any comparisons we make. We also note 
that this will, to some extent, affect our own analy- 
ses that we present below, and it will be expected to 
overestimate the accuracy of our rankings. Unfortu- 
nately we have no data that would allow us to esti- 
mate, even approximately, how important this is. To 
deal with this problem satisfactorily would involve 
a model that incorporated "effects" for each author 
and the detailed information about the authorship 
of each paper that was cited. Goldstein (2003, Chap- 
ter 12.5) describes a multilevel "multiple member- 
ship" model that can be used for this purpose, where 
individual authors become level 2 units and papers 
are level 1 units. 

The UK Research Assessment Exercise was pub- 
lished on 18th December 2008, covering the years 
2001-2008. 52,409 staff from 159 institutions were 
grouped into 67 "units of assessment" (UOA): up 
to 4 publications for each individual were considered 
as well as other activities and markers of esteem. 
Panels drawn from around 1000 peer reviewers then 
produced a "quality profile" for each group, summa- 
rizing in blocks of 5% the proportion of each submis- 
sion judged by the panels to have met each of the 
following quality levels: "world-leading" (4*), "in- 
ternationally excellent" (3*), "internationally recog- 
nized" (2*), "nationally recognized" (1*), and "un- 
classified." This procedure is notable in terms of its 
use of peer judgment rather than simple metrics, 
and allowing a distribution of performance rather 
than a single measure. All the data is available for 
downloading (Research Assessment Exercise, 2008). 
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Oxford 24.50 
Cambridge 16.00 
Imperial College London 13.90 
Warwick 24.00 
Bristol 23.00 
Nottingham 9.00 
Leeds 11.00 
Kent 12.00 
Southampton 28.00 
Bath 15.00 
Lancaster 21.65 
St Andrews 7.00 
Sheffield 10.70 
Newcastle upon Tyne 13.00 
Manchester 10.90 
Glasgow 13.00 
Open University 7.00 
London School of Economics 13.00 
Brunei 10.00 
University College London 13.50 
Durham 11.60 
Edinburgh and Heriot-Watt 30.00 
Strathclyde 10.33 
Queen Mary London 8.20 
Reading 7.70 
Salford 9.80 
Liverpool 5.00 
Greenwich 2.00 
Plymouth 4.00 
London Metropolitan 4.00 
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Fig. 1. "Quality profiles" for 30 groups under UOA22 "Statistics and Operational research": 
Exercise 2008, ranked according to mean score: numbers of staff taken into account are shown. 



UK Research Assessment 



Figure 1 shows the results relevant for most statis- 
ticians: the 30 groups entered under UOA22: "Statis- 
tics and Operational Research." These have been 
ordered into a league table using the average num- 
ber of stars which we shall term the "mean score," 
which is the procedure adopted by the media. Also 
reported is the number of full-time equivalent staff 
in the submission. Controversy surrounds this num- 
ber as it is unknown how selective institutions were 
in submitting staff — it was originally intended that 
the total pool of staff would also be reported but 
late in the day there were objections raised as to 
the definitions of eligibility and this requirement was 
dropped. 

The financial consequences of this whole exercise 
concern the distribution of around £1.5 billion of fu- 
ture funding. After publication of the quality profiles 
it was revealed that for funding purposes 4*, 3*, 2*, 1* 
outputs would be weighted proportional to 7,3, 1,0: 
in further analysis we consider the "mean funding 



score" as 7p4 -|- 3p3 +P2, where pi is the proportion 
of outputs given i stars. 

In their report, Adler and colleagues argue that 
statistical analysis of performance data requires some 
concept of a model, and the provision of a qual- 
ity profile rather than just a single number suggests 
it could be used for this purpose. We might first 
view the quality profile as representing the sampling 
distribution of material arising from each group, in 
fact a single Multinomial observation with proba- 
bility {p4,-,Pz-,P2-,'Pi)'- ifi in the spirit of a bootstrap, 
we simulate from these distributions and rank the 
institutions at each iteration, we can produce a dis- 
tribution for the predicted rank of a random future 
output from each group as shown in Figure 2. 

We note the substantial overlap of the distribu- 
tions: in fact the rank distributions are highly mul- 
timodal due to the extreme number of ties at each 
iteration, which explains the somewhat anomalous 
results for some groups in which the median rank 
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Fig. 2. Predicted rank of a future output from each group: median and 95% intervals are shown based on 10,000 iterations. 



order is substantially different from the mean-score 
order in which the institutions are plotted. 

We are not, however, particularly interested in a 
single output and instead we may want to focus on 
the accuracy with which a summary parameter, such 
as the underlying mean funding score, is known: we 
treat this as an illustration of a general technique 
for analyzing any summary measure arising from a 
specified weighting. It then seems reasonable to take 
into account the quantity of information underly- 
ing the quality profile: each individual contributes 4 
publications and the publications count for 70% of 
the quality profile, and so we shall take a rough "ef- 
fective sample size" as 6 outputs per staff member. 
Note that this does not mean that we are treating 
the publications as being a random sample from a 
larger population, but as relevant information con- 
nected through a probability model with some un- 
derlying parameter which may, in our particular il- 
lustration, be interpreted as the expected funding 
score of future outputs. 

It would be possible to convert to ordered categor- 
ical data by multiplying the quality profile for each 
group by the number of publications taken into ac- 
count (6 times the number of staff). Here, for the 



sake of simplicity, we have assumed a normal sam- 
pling distribution by estimating a standard error of 
the mean funding score as the square root of the 
sample variance of the profile divided by 6 times 
the number of staff. 

Figure 3a shows the resulting estimates and 95% 
intervals for the mean scores. Treating these as nor- 
mal distributions we can simulate future mean scores, 
rank at each iteration, and form a distribution for 
the "true" rank of each group. These are summa- 
rized in Figure 3b. 

We see that for 14 out of 30 groups the 95% inter- 
val for the mean funding score overlaps the overall 
mean for all groups. Correspondingly we can identify 
14 groups for which the 95% interval for their "true" 
rank, based on their mean funding scores, lies in ei- 
ther the top or bottom half. Both the mean funding 
scores and ranks, particularly for the smaller institu- 
tions, are associated with considerable uncertainty 
and this should warn against over-interpretation of 
either. If desired this could provide a basis for al- 
location into one of three groups for resource allo- 
cation purposes, although we would not necessarily 
recommend such a procedure. 
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Fig. 3. (Left) Estimates and intervals for expected funding score of outputs from each group. (Right) Summary of distribution 
of ranks of expected scores. Median and 95% intervals are shown based on 10,000 iterations. 



We could, in principle, take this analysis further 
by noting that if we are really interested in pre- 
dicting future performance, then we should be tak- 
ing into account the possibility of regression-to-the- 
mean, recognizing the variation within each institu- 
tion that would be expected over time. We could 
do this by fitting a hierarchical/multilevel model 
where conditioning takes place on the current scores 
(see Goldstein and Leckie, 2008, for an example us- 
ing school league tables). We could adjust for back- 
ground factors such as available resources in order to 
reduce the within-institution variability and to help 
satisfy relevant exchangeability assumptions, and so 
produce an "adjusted" institution effect. Whether 
we use this adjusted effect, or the fitted mean, as 
a basis for comparing groups would depend on the 
purpose: if we were university administrators want- 
ing to know whether a group had done well given 



the resources available, then we would examine the 
adjusted affect. If, however, we wished to use the 
current scores simply to allocate income, then the 
fitted mean would be appropriate: see Goldstein and 
Leckie (2008) for a close examination of the poten- 
tial role for different kinds of adjustments when com- 
paring schools. In practice it is likely that such an 
analysis would be considered too complex. 

In conclusion, we agree with the Report's stric- 
tures on the meaning of citation counts and would 
go further and argue that citations form a rather 
bizarre measure of research performance, as if the 
sole purpose of research was to provide material 
for other researchers. If they are to be used, we 
would argue that they be analyzed within a sta- 
tistical modeling framework that fully incorporates 
uncertainty and dependency. As we have shown, for 
example, in Figure 3b, this could help to guide fund- 
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ing decisions by avoiding fine distinctions that may 
reflect little more than random noise. But citations 
alone, no matter how carefully analyzed, can only 
provide one measure of performance, and we feel 
strongly that they should be part of a broader pro- 
file that takes into account other measures of real 
world impact and is assessed using peer judgement 
rather than mechanistic and spuriously "objective" 
processes. 
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